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Abstract 

In this work, we present a minimum entropy analysis scheme for variable selection and 
preliminary data analysis. The variable selection can be achieved by the increasing preference 
of variables. We show such a preference to has a unqiue form, which is given by the entropy of 
models associated with variables. Evaluating the entropy provides a complete ranking scheme 
of variables. This scheme not only indicates preferred variables but also may reveal the system's 
nature and properties. We illustrate the proposed scheme to analyze a set of geological data for 
three carbonate rock units in Texas and Oklahoma, and compare to the discriminant function 
analysis. The result suggests this scheme to provide a quick and robust analysis, and the use in 
data analysis is promising. 

PACS numbers: 02.50.-r, 02.50.Le, 02.50.Sk, 89.70.+C 

1 Introduction 

When one investigates an unknown system according to experimental observations made on this 
system, two questions are commonly addressed. How does one model the system? The model 
associates the experimental observations with corresponding experimental responses. Unfortunately, 
there is no systematical method to answer it. It is usually resolved through the methods of trials 
and errors, empirical regressions, and some intuitive assumptions etc. The second question is that 
suppose a model is found, which experimental observations shall be codified into the model. This is 
exactly a variable selection question. Here we will put the first question aside, and focus on resolving 
the second question. 

A problem similar to variable selection has been tackled through different approaches in the past. 
It is the model selection problem. Several methods such P-value, Bayesian approach, and KuUback- 
Leibler distance based approach et al. are some examples. The P-values method selects model by 
comparing probability of model given a null model and experimental data sets to a threshold value 
assessed from same data sets Yet since this method is restricted to two models and required 
ad hoc rules to assess threshold value, people has developed the Bayesian approaches to overcome 
these defects (pp and [2]). The Bayesian method applies Bayes theorem to update our beliefs and 
uncertainty about models from prior distributions generated from some prior modeling rules first. 
A preferable model, thereafter, is chosen according to Bayes factor, ratio of posterior distributions 
of different models. Bayesian Information Criterion (BIC) is one of most popular Bayesian based 
model selection criteria (^, Kieseppa, Forbes and Peyrard's works in Yet all of these methods 
require prior information generated from some ad hoc prior modeling rules that suits people's need. 

Aside from Bayesian framework, people has developed relative entropy, mutual information, or 
KuUback-Leibler distance based approach ([2], 0, and An Information Criterion (AIC) of Akaike 
The rationale is to design a criterion based on aspect of entropy. The reason of employing 



relative entropy for selection criterion is discussed in [5]. Recently, another criterion for model 
selection, CIC, is proposed by Rodriguez based on aspect of information geometry (7j. That is a 
generalized version of AIC and BIG. Preferred model is selected to minimize a quantity derived from 
Bayes's theorem. 

In the case of variable selection, Dupuis and Robert proposed to model the system with 
different combinations of experimental observations, the variables selected from a set of variables. It 
generates several models associated with different combinations of variables. Thus variable selection 
problem becomes the problem of model selection. Evaluating the KuUback-Leibler distance between 
full model described by a complete set of variables for interested system and it's approximations, 
submodels, described by subset of variables. When the full model is tractable, the preferred submodel 
is selected when it's Kullback-Leibler distance reaches a threshold value. This threshold value is 
usually estimated by experiences from experimental data. Since submodels are projections of full 
model, there is no need the prior modeling rule to generate prior distribution for submodel. Yet one 
still requires prior information on full model. In addition, when there is no complete set of variables, 
namely, only limited experimental measurements regarding to the system can be conducted, this 
strategy becomes inadequately. 

Our goal is to apply the method of maximum entropy (ME) to design a tool for variable selection 
that is free from defects in Dupuis and Robert's approach. Following the axiomatic approach pro- 
posed in developing method of ME to a tool for assigning a probability to a system |S] and a tool for 
updating probability jH], Tseng has showed an entropic criterion for model selection [Sj. Based on 
this study, we generalize it to obtain a Minimum Entropy Analysis (MEA) in Sec. 2. The proposed 
analysis tool provides a complete ranking scheme of variables. It not only allows to select preferred 
variables but also to suggest an analysis scheme to reveal nature and properties of the system. To 
illustrate the MEA scheme, we will study a binary system in the geology in Sec. 3. Some discussions 
are made afterward. Our demonstration shows the MEA scheme to be a promising tool in data 
analysis. Finally some conclusions are given. 

2 The minimum entropy analysis 
2.1 Basic features 

Despite designing a pertinent criterion for selection, the selection also can be achieved by the increas- 
ing preference of variables. Since properties and meaning of the variables, experimental observations, 
are sometimes quite different, it is meaningless to compare variables directly. For example, suppose 
two experimental observations, mass and area, are measured for studying a system. How does one 
evaluate weightings of these two quantities to determine the dominant quantity in the model given 
to study this system? Namely, what is the preference of these variables? In Dupuis and Robert's on 
variable selection problem ^ , they proposed codifying the variables by a specific model such as logit 
model for a linear binary system. Afterward, ranking variables by evaluating the Kullback-Leibler 
distance between the model described by complete set of variables and it's projections, submodels, 
described by subset of variables. Yet when there is no complete set of variables, namely, experimen- 
tal measurements only provide limited information regarding to the system, this strategy becomes 
inadequately. 

The approach for model selection proposed by Tseng may spells out a resolution in variable 
selection problem [^. It states that suppose a family of probability models is found to interpret 
the system {P™ {x)}, where m labels the model and x denotes states of the model. The preference 
of models is uniquely determined, which is in the form of relative entropy of model P™ (x) and a 
uniform reference measure /x. 
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This scalar value measures differences between models and a uniform reference measure. Maximizing 
the relative entropy indicates P™ (x) to equal to fi. Namely, there is no information regarding to the 
system being codified into P™ (x). Within the family, when the relative entropy is decreased, there 
is more information of the system being codified into P™ (x) One can then rank those probability 
models according to decreasing 5* [P™ value. 



2.2 Logic behind the MEA 

Based on Dupuis and Robert's approach, one can determine the preference of variables by first cod- 
ifying these variables into a model. This model can be any function that optimally associates the 
experimental observations, variables, and responses. According to Tseng's work on model selection 
[S], one needs to further codify this model into a probability distribution of observing the exper- 
imental responses given the variables in order to compute the preference. Thus the preference of 
variables is determined form the preference of those probability distributions. 

Based on these aspects, the logic behind the proposed MEA scheme then involves two stages. 
The first stage is to determine a probability model that associates experimental observations and 
responses. The method of ME proposed by Jaynes |S] provides a solution. Since the method of ME 
requires the information that will be codified into the probability distribution to be in the form of 
constraint, it turns the question of probability assignment into a question of what is constraint. We 
will come back this point later. 

Next we follow the axiomatic approach [H] to determine the form of preference of the probability 
distributions. The basic strategy (Skilling of [HI) is one of induction: (1) if a general rule exists, 
then it must apply to special cases; (2) if in a certain special case we know which is the best 
approximation, then this knowledge can be used to constrain the form of preference; and finally, 
(3) if enough special cases are known, then preference will be completely determined. The known 
special cases are called the "axioms" of ME. The axioms used here must reflect the conviction that 
one should not change one's mind frivolously, that whatever information was originally codified into 
the exact probability distribution is important and should be preserved. As shown in [S], three 
axioms are employed: (1) local information has local effects; (2) the ranking should not depend on 
the coordinates of the system, and (3) consistency for independent subsystems. The functional form 
for the preference is therefore uniquely determined, which has the form of relative entropy, Eq.ljTjl. 
Please refer to Caticha of for detail proof on the axiomatic approach. 



2.3 The MEA scheme 

Suppose a model, function of / ^x, , is given to associate N experimental observations denoted 
by variable x — {xi, X2, • ■ ■ xjv} and parameters (3 = /?2, ■ • ■ /3jv} with experimental responses 
M ^x, /3^. Each observations is repeated I times, which give / measurements Xj = {xj , x^ , ■ ■ ■ x[} 

and corresponding responses M^x, = {Af ^, M^, • ■ • M'}. The response M will be either "1" 

for positive response or "0" for negative response. Notes that one way to determine f3 is through 
method of Maximum Likelihood Estimate (MLE) ^U). For example, the logit model 

^ ^ exp2^-^;^ /3iXi 4- 1 

is usually given as a regression model for a binary output system. Given these N variables, there 
will be 2^ — 2 different combinations (subsets) of variables Kg- G x to be chosen from x, and form 
2^-2 submodels / (x^. , 
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Afterward we have to determine the probabihty ^Xj. |/3s. ^ of observing the experimental 

responses M given the subset of variables x^ . . In the framework of method ME, one has to identify 
the relevant information to be codified into the probability distribution. In this example, it is obvious 
that the experimental responses are the quantity we need to know about the system, that can be 
written in the form of constraint 

(M}=^/(x,.,4)p, (x,j4) , (3) 

where (M) is the expectation value of the responses. Thus the method of ME indicates the preferred 
Ps (xs, 1 A, ) to be 

Ps (x.. 1/3.. ) = 1 ^ , (4) 

where partition function Z = J2xi gi^xp— A/ (^Xs-,(3si^ and A is a Largrangian multiplier, which 
is set to one for sake of simplicity. Alternatively, the probability distribution can be obtained by 
simply normalizing submodels / (^Si , Psi^ , 

/(x..,4) 

where normalization constant Z' = J2x~ ex / (xsi,/3si^ as well. Notes that this form of probability 

distribution is actually just a first order approximation of Eq.Q. 

Thus the increasing ranking order of these submodel /^x^. ,/3s.j is given by the decreasing 

relative entropy Eq.Q with P"* being replaced by P, (x^. 1/3^. ) or P^' (x^. 1/3^. ). Furthermore, one 
can easily rewrite Eq.QJ into 

s[Ps\^I]^s[Ps] + \Il^I , (6) 

where S[Ps] = — X^x, gx^^ (xs; lAi ) InPg (xg^ \0si) entropy of the submodel. Because \nfi is a 
constant value for a uniform reference measure, the preference of submodels is equivalent to the 
decreasing S [Ps] value. 

Evaluating the entropy of all submodels, a complete ranking scheme of different subsets of vari- 
ables Xfi. is determined. Preferred subset of variables then can be identified that is the one that 
has minimum entropy value within this set of variables. Notice that the use of this scheme is not 
totally exhausted yet. By further analyzing the ranking scheme of different subsets x^^, one may 
determine significance of different combinations of variables that are codified into the submodel. 
The nature and properties of the system may thereafter be revealed through this analysis. For 
example, as we know, correlation functions between two variables arbitrarily chosen within x may 
reveal some properties of the system. Although we did not compute the correlation functions in this 
scheme, the preference of different correlations is still implicitly spelled out by the ranking scheme 
of different subsets. One can attribute this to when the model is given to associate the variables and 
responses, the correlations between the variables are defined in the model. Thus determining the 
ranking scheme of different combinations of variables indirectly indicate the significance of different 
correlations. 

Thus one may treat the MEA scheme as a quick data analysis tool. It provides preliminary 
information about the system. This use is implicitly in some other approaches mentioned previously. 
We will illustrate the use of the MEA scheme in detail by studying a geological problem next. 
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3 A special binary system in Geology 



3.1 The problem 

Considering a geological example of sample classification (Davis of ^J)- briefly address the result 
by means of a standard tool, the Discriminant Function Analysis (DFA), for classifying geological 
samples. Then we show how to extract important variables in determining the category of samples 
using our minimum entropy analysis. Furthermore, some information regarding the formation of 
sample rocks. Comparing both results from the DFA and the MEA improves our understanding and 
enhances our confidence in our MEA scheme. 

Saltwater is trapped in sedimentary rocks at the time they are formed in the marine environ- 
ment. The chemical composition of the connate water is subsequently modified by ion exchange and 
other reactions, by mixing with other brines, and by dilution by infiltrating surface waters. Brines 
recovered during drillstem tests of wells may have relict compositional characteristics that provide 
clues to the origin or depositional environment of their source rocks. Table 1 contains brine analyses 
for oil-field waters from three groups of carbonate units in Texas and Oklahoma (Davis of ^H]). 
The first column in Table 1 denotes the brine samples belonging or not belonging to some specific 
carbonate unit. Unit G. 

3.2 The discriminant function analysis and results 

The discriminant function analysis combines a rationale similar to that of analysis of data variance 
with computational procedures based on eigenvector calculations, e.g. the PCA (principle component 
analysis). Multivariate measurements made on the samples alone, such as the brine data in TableQ] 
can be used in the DFA to find combinations of measurements that allow the various categories of 
samples to be distinguished. The problem of DFA is basically one of finding a set of linear weights 
for the variables that causes a multivariate analogue of the F-ratio to be a maximum. A succession 
of discriminant functions along which the samples are as distinct as possible, can be thus calculated 
and each represents successively the most efficient discriminator possible. For many calculation 
details, please refer to the book of Davis 

The DFA can be applied to those data in TableU to determine if they are distinctive. The 
first discriminant function thus calculated is an inner product of (-0.3765, -0.0468, 0.0112, -0.0148, 
-0.0174, -0.0110)- (HC03,S04,Cl,Ca,Mg,Na)'^, which can clearly separates samples from Unit G 
and other units. Please note that the weighting factors in the first discriminant function for variables 
HCO3 and SO4, i.e. -0.3765 and -0.0468, represent the first two largest factors in magnitude among 
six, thus indicating those two variables play the most dominant effect in classification. 

3.3 The minimum entropy analysis 

According to conventional studies ('4, and |10|). one can pertinently associate six experimental 
observations with binary responses in this geological example through the logit model, Eq.JSJ. 
The variables xg'= {xi, X2, • • ■ Xg} denotes observations on contents of six chemical compounds 
{HCO3, SO4, CI, Ca, Mg, Na} correspondingly. As shown in TableU nineteen measurements were 
made. There are five positive experimental responses, denoted by symbol "Y" and fourteen negative 
responses. The method of ME suggests that the preferred probability distribution of observing the 
experimental responses given variables xg to give 
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Table I: Chemical analyses of brines (in ppm) recovered from drillstem tests of three carbonate rock units in Texas and 
Oklahoma. Adapted from Davis of 11 . 



Unit G 


HCO3 


SO4 


CI 


Ca 


Mg 


Na 


N 


10.4 


30 


967.1 


95.9 


53.7 


857.7 


N 


6.2 


29.6 


1174.9 


111.7 


43.9 


1054.7 


N 


2.1 


11.4 


2387.1 


348.3 


119.3 


1932.4 


N 


8.5 


22.5 


2186.1 


339.6 


73.6 


1803.4 


N 


6.7 


32.8 


2015.5 


287.6 


75.1 


1691.8 


N 


3.8 


18.9 


2175.8 


340.4 


63.8 


1793.9 


N 


1.5 


16.5 


2367 


412 


95.8 


1872.5 


Y 


25.6 





134.7 


12.7 


7.1 


134.7 


Y 


12 


104.6 


3163.8 


95.6 


90.1 


3093.9 


Y 


9 


104 


1342.6 


104.9 


160.2 


1190.1 


Y 


13.7 


103.3 


2151.6 


103.7 


70 


2054.6 


Y 


16.6 


92.3 


905.1 


91.5 


50.9 


871.4 


Y 


14.1 


80.1 


554.8 


118.9 


62.3 


472.4 


N 


1.3 


10.4 


3399.5 


532.3 


235.6 


2642.5 


N 


3.6 


5.2 


974.5 


147.5 


69 


768.1 


N 


0.8 


9.8 


1430.2 


295.7 


118.4 


1027.1 


N 


1.8 


25.6 


183.2 


35.4 


13.5 


161.5 


N 


8.8 


3.4 


289.9 


32.8 


22.4 


225.2 


N 


6.3 


16.7 


360.9 


41.9 


24 


318.1 



where partition function Z — X^xg ^^P~/iogit Or normalizing the logit model within this 

data set gives 



P (xd/3j , (8) 



where normalization constant 



expX^i^i + 1 

^,^^^xpEL^_ (9) 
X expX;i=i /?»xj + 1 

Given these six variables, 2^ — 2 = 62 different combinations of variables e x^ are obtained. 

Thereafter, one can generate 62 probability submodels P,; (x^) = P (^x^ or P' ^x^ |/?^ . 

Evaluating the entropy of Pg^ (x^), Eq.®, with different subsets of variables x^ gives ranking 
order of different submodels P, . (x^ ) . The coefficient (ii are determined through fitting the logit 
model to experimental measurements by MLE method (in Appendix: software for ordinal data 
modeling of - a MATLAB function of "Maximum likelihood estimation and model criticism"). 
The result is listed in Table|nJ Here we only list 18 out of 62 submodels. We calculate the entropy of 
two probability distributions, Eq.Q and l|Hl), in second and third column respectively. The ranking 
scheme is in the order of decreasing entropy value. To analyze this ranking scheme, we proceed with 
a two-step approach. In first step, we examine the submodel that has the minimum entropy value. In 
this example, there are 16 out of 62 submodels that has the minimum entropy value, 2.866 or 1.791 
in the second and third column of Table^] respectively. Notes that since the minimum significant 
figure of experimental data in Tabled is three, the entropy value should also has three significant 
figures and forth digit is just an estimate. The preference of these 16 submodels are indistinguishable. 
Notes that the digits in round bracket shows a numerical result when the significant figure is not 
considered. It just indicates that if the significant figures are higher the resolution of entropy will 
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be better. Thus the preference of these 16 submodel then stiU can be identified. That will further 
aid the analysis of preference in detail. 

In order to determine the most dominant variables from these 16 submodels, frequencies of 
six variables appeared in these 16 submodels are recorded in the present. The frequencies for 
observing first and second variable are 16 and 15 respectively and 8 for rest of variables. This 
result suggests that the ability of interpreting the experimental measurements by the logit model 
is strongly dominated by the first variable, HCO3, and the second variable, SO4. The variables 3 
to 6 seem to play a minor role here. This is exactly the result obtained though the DFA analysis 
mentioned previously. Yet the MEA scheme is more straightforward. 

In second step, we analyze the ranking scheme further to identify preference of first and second 
variable. Since the variables 3 to 6 play a minor role here, we concentrate on first two variables. We 
list two more submodels that only include the first and second variable, IICO3 and SO4, respectively 
in Table im The entropy value in third column shows a dramatic changes from 2.328 of submodel 
{010000} that only has second variable, 2.060 of {100000} that has only first variable, to 1.791 
of {110000} for the case of two variables being simultaneously included. The same trend is also 
observed in the second column although no dramatic changes is observed. The ranking scheme 
indicates that the first variable HCO3 should play a more important role than the second variable 
SO4 in the model. 

Table II: The ranking scheme of six chemical compounds. First column represent the six chemical compounds. The 
number "1" denotes the correspondin g v ariables in first row to be considered and "0" denotes to be negelected. Second 
column present the entropy value, Eg 161 of the probability distribution Ps(xs;) given by and the third column is 

entropy value of Pi. (xs^) given by Eq.@. Each row represents a submodel. Only 18 submodels are listed. 



HC03 


S04 


CI 


Ca 


Mg 


Na 


S[Ps] 







1 














2.893(229075) 


2.328(713745) 


1 

















2.881(069331) 


2.060(483312) 


1 


1 














2.866(92132) 


1.791(857378) 


1 


1 











1 


2.866(921309) 


1.791(85668) 


1 


1 


1 











2.866(921264) 


1.791(854715) 


1 


1 





1 








2.866(921168) 


1.791(849677) 


1 





1 


1 


1 


1 


2.866(921139) 


1.791(848879) 


1 


1 


1 


1 








2.866(921109) 


1.791(84701) 


1 


1 





1 





1 


2.866(921101) 


1.791(846531) 


1 


1 








1 





2.866(921112) 


1.791(843955) 


1 


1 


1 








1 


2.866(921028) 


1.791(842415) 


1 


1 


1 





1 





2.866(921084) 


1.791(842147) 


1 


1 








1 


1 


2.866(921072) 


1.791(841769) 


1 


1 





1 


1 





2.866(921005) 


1.791(840749) 


1 


1 


1 


1 


1 





2.866(920963) 


1.791(83836) 


1 


1 





1 


1 


1 


2.866(920964) 


1.791(838358) 


1 


1 


1 


1 





1 


2.866(920971) 


1.791(838291) 


1 


1 


1 





1 


1 


2.866(920965) 


1.791(838227) 



4 Discussions 

Lithologically speaking, unit G is mainly composed of dolomite (CaMg(C03)2) and anhydrite 
(CaS04). In ancient geological times. Unit G, which is in geology called the "Grayburg Dolomite" 
|11| . experienced two important sedimentary processes of dolomitization, which is associated with 
the dissolution of calcite by acidic fluids, and evaporation. Anhydrite is one of the index products 
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from evaporation. Chalcraft and Ward further claimed that the principal diagenetic processes in- 
clude dolomitization, anhydrite occlusion of primary porosity, and leaching jl2j . The dolomitzation 
plays a crucial role in the formation of unit G, and followed by anhydrite occlusion of primary 
porosity. One therefore can infer that the process of the dissolution of calcite by acidic fluids is more 
significant than anhydrite occlusion in the formation of unit G. Namely, the process involves with 
chemical compound HCO3 is the most important one among the six compounds. 

In analyzing a set of geological data to seek out the origin or depositional environment of their 
source rocks, the MEA suggests that HCO3 and SO4 are two key variables in the occurrence process 
of Unit G. The MEA also suggests that HCO3 plays a more important role than SO4. Therefore, we 
can conclude that formation of Unit G may strongly involve with chemical process associated with 
HCO3. The chemical process associated with SO4 may then plays a minor factor in the formation. 
It is the exact result inferred previously but the MEA analysis is more straightforward. Similarly, 
one may conduct further analysis to extract more information, yet it is out of our scope here. 

5 Conclusions 

The minimum entropy analysis scheme is proposed to analyze experimental data for extracting 
information of the corresponding system such as which experimental observations to play a more 
important role etc. This is a question of variable selection, and can be resolved by determining 
the preference of these observations. To determine the preference, one associates the experimental 
responses and those observations by a probability model first. Thereafter, as shown in the context, 
the form of preference is uniquely determined through the axiomatic approach It is in the form 
of entropy of probability of observing the experimental responses given variables. The preferred 
variables are the one that have minimum entropy value. Furthermore, since the minimum entropy 
analysis present a complete ranking scheme of different combinations of experimental observations, 
it indirectly indicates significance of different combinations of variables in the model. This ranking 
scheme not only suggests the preferred variables that should be codified into the model but also 
may spell out a route to study the system. Besides, this design resolves two defects in Dupuis and 
Robert's approach ^ mentioned previously. 

We have illustrated the use of the minimum entropy analysis by analyzing a set of geological 
data for three carbonate rock units in Texas and Oklahoma. The MEA scheme indicates the pre- 
ferred variables most relevant to the formation of unit G or Grayburg Dolomite to be HCO3 and 
SO4. This result agrees with the result from another well known analysis tool, the discriminant 
function analysis. Furthermore, since the MEA presents a complete ranking scheme of six chemical 
compounds measured in the samples, it points out a principal diagenetic process obtained in |12j . 
Yet this conclusion is not clear in the discriminant function analysis. 
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